-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-36311: [C++] Fix integer overflows in utf8_slice_codeunits
#36575
Conversation
@pitrou I haven't added any python tests (although I can if necessary) - but the original examples should be working now as well. >>> pa.compute.utf8_slice_codeunits(f"AB{chr(127917)}C{chr(127917)}ㇱD", start=2, stop=None, step=4)
<pyarrow.StringScalar: '🎭D'> >>> s = f"AB{chr(127917)}C{chr(127917)}ㇱD"
>>> a = pa.array([s])
>>> pc.utf8_slice_codeunits(a, start=0, stop=None)
<pyarrow.lib.StringArray object at 0x7f0fa4bb85e0>
[
"AB🎭C🎭ㇱD"
] |
The current Python tests already exercise the arrow/python/pyarrow/tests/test_compute.py Line 537 in a63ead7
|
This PR is almost ready for merging and would be nice to include in 13.0.0. @raulcd |
@github-actions crossbow submit -g cpp |
This comment was marked as outdated.
This comment was marked as outdated.
This may need more work, unfortunately. I've learned that passing >>> s = pa.scalar("𝑓öõḍš")
>>>
>>> pc.utf8_slice_codeunits(s, start=2, stop=None, step=-1)
<pyarrow.StringScalar: ''>
>>>
>>> s.as_py()[2:None:-1]
'õö𝑓' The current behavior is actually equivalent to this: >>> s.as_py()[2:sys.maxsize:-1]
'' So yeah... expanding the python tests was a good call. |
Also reorganizes C++ tests
When `stop=None`, negates the default value of `sys.maxsize` if `step < 0` to mimic the behavior of Python arrays
@github-actions crossbow submit -g cpp |
This comment was marked as outdated.
This comment was marked as outdated.
@github-actions crossbow submit -g cpp |
Revision: 58eb873 Submitted crossbow builds: ursacomputing/crossbow @ actions-4f1abc3b6c |
@raulcd It would be nice to cherry-pick this in 13.0.0. |
### Rationale for this change The default value for the `SliceOptions::stop` is `INT64_MAX`, which isn't considered in several internal calculations - resulting in integer overflows and unexpected behavior when `stop` isn't provided. Also note that running the included tests without the fixes should result in ubsan errors (it did for me, at least). ### What changes are included in this PR? - Adds some logic to `SliceCodunitsTransform` that handles potential overflows - Adds tests for cases where the `start` param is positive/negative and `stop` is the maximum value **Update** Discovered that `utf8_slice_codeunits` deviates from Python array behavior when `stop=None` and `step < 0`, so further changes were made: - Handles `INT64_MIN` for `SliceOptions::stop` on C++ side, adds more tests. - Updates Python bindings for `SliceOptions` so that the default value when `stop=None` (`sys.maxsize`) is negated when `step < 0` - Adds `None` as a possible `stop` value in Python tests ### Are these changes tested? Yes (tests are included) ### Are there any user-facing changes? In theory, altering the behavior of `utf8_slice_codepoints` when `stop=None` and `step < 0` could be considered a breaking change. That being said, the current implementation produces incorrect results whenever `None` is even used, so it probably isn't one in practice... * Closes: #36311 Authored-by: benibus <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 143c691. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
It would also be nice to backport this to 12.0.0 if possible, since right now we can't upgrade to 13.0.0 (see pandas-dev/pandas#55374 (comment)). |
Rationale for this change
The default value for the
SliceOptions::stop
isINT64_MAX
, which isn't considered in several internal calculations - resulting in integer overflows and unexpected behavior whenstop
isn't provided.Also note that running the included tests without the fixes should result in ubsan errors (it did for me, at least).
What changes are included in this PR?
SliceCodunitsTransform
that handles potential overflowsstart
param is positive/negative andstop
is the maximum valueUpdate
Discovered that
utf8_slice_codeunits
deviates from Python array behavior whenstop=None
andstep < 0
, so further changes were made:INT64_MIN
forSliceOptions::stop
on C++ side, adds more tests.SliceOptions
so that the default value whenstop=None
(sys.maxsize
) is negated whenstep < 0
None
as a possiblestop
value in Python testsAre these changes tested?
Yes (tests are included)
Are there any user-facing changes?
In theory, altering the behavior of
utf8_slice_codepoints
whenstop=None
andstep < 0
could be considered a breaking change. That being said, the current implementation produces incorrect results wheneverNone
is even used, so it probably isn't one in practice...